Research

The dataset is obtained from the https://docs.google.com/document/d/1qEcwltBMlRYZT-l699-71TzInWfk4W9q5rTCSvDVMpc/pub. The red-wine dataset is the first in the list.
Prior starting the project I searched for the optimal composition of Red Wine available in market. Later I plotted the composition of various variable in the Red Wine and compared them with the Optimal Composition I had found.
The report explores 1599 entries of Red Wine data set with 13 variables.

Uni-Variate Plots Section

In this section, we analyse every variable in the dataset independent of the other variables.
Let us first have a look at the summary and structure of the Red Wine data set so as to have a clear picture of the variable in the data set.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Our data set consists of 13 variables and 1599 entries. The user_feedback variable is an additional variable added to enhance the readability of plots. So in all 14 variables exist now.

User Rating Plot : Quantitative & Qualitative

The entries in the quality variable of the dataset were mapped to create a new variable names user_feedback for qualitative analysis. Rating of 0-4 was treated bad, 5-6 was treated medium and the rest were treated good. Below are the plots :

Majority of the wine users had rated the wine 5 or 6. Very few had rated it below 5. Few had rated it above 6. The qualitative plot reflects the same readings but in a more readable form. Most of them have rated the wine Medium, few have rated it as Good and very few have rated it Bad.

Acidity Plot : Fixed and Volatile

As per the composition specified in the website : http://waterhouse.ucdavis.edu/whats-in-wine/red-wine-composition, fixed acidity of Red Wine must be 6000 mg/L which is 6 g/dm^3. The median of the fixed acidity plot appears somewhere near 7.90 g/dm^3. The mode seems to be at 7 g/dm^3. Though the values are larger than the value given in the website they are closer to it. Volatile acidity of Red Wine must be 600 mg/L which is 0.6 g/dm^3. The median of volatile acidity is around 0.52 g/dm^3. The mode seems to be at around 0.7 g/dm^3.

Total Acidity Plot : Tartaric + Acetic + Citric

We have created a new variable called total.acidity by adding the 3 acid components present in out dataset.

The plot shows that maximum number of samples have a total acidity around 8. Hence the mode shall be around 8. The median is 8.72 and mean is 9.12.

Chlorides Plot

The plot shows that maximum number of samples have a chloride content around 0.08. Hence the mode shall be around 0.08. The median is 0.079 and mean is 0.087.

Total Sulfur Dioxide and Sulphates Plot

Total Sulfur Dioxide Plot shows that majority of the samples have total sulfur dioxide content of 25. As per the summary median is 38 and mean is 46. Outliers which were removed in our code would be the reason for the higher value of mean and median.
Sulphate plot shows that majority of the samples have sulphate content of 0.6. As per the summary median is 0.62 and mean is 0.65.

Sugar Content Plot

The general threshold of perception of sweetness as per the link : http://www.jancisrobinson.com/articles/wines-that-are-medium is 2g/L which is 2 g/dm^3. Mean, median and mode seems to be around 2 g/dm^3. Even though most of the entries have appropriate sugar content. Wine samples corresponding to a few entries seems to have very high Sugar content (in the plot outliers have been removed). Values go as high as 15.5 g/dm^3 which is too high for a Red Wine. Diabetes Alert !

Density Plot

Most of the entries have density between 0.99 to 1 g/cm^3.

pH Plot

A solution with pH of less than 7 is acidic and greater than 7 is basic. Let us analyse the pH of the entries in the data set.

As expected all the entries fall into the acidic range as Red wine is acidic in nature. The mean and median of the sample lie around 3.3; pH of 3.3-3.6 is ideal for red wine.

Alcohol Content Plot

As per the details in the http://www.winecompanion.com.au/wine-essentials/wine-education/alcohol-content-in-red-wines, the alcohol of any red wine in excess of 14.5% (alcohol by volume) can be said to be high. Seems like only a few samples of Red Wine have alcohol content above 14.5%. So others won’t Go High on drinking Red Wine. The mean and median appear at around 10% by volume.

Bi-Variate Plots Section

In this section, we analyse the variables in the plot with respect to other variables. Here we shall analyse most of the variables with respect to the user_feedback. Doing so will provide information regarding the effect of various variables on the user_feedback. In all our Bi-Variate plots we shall consider the variable being tested as the X paramter. The index variable ‘X’ as the Y parameter. And the user_feedback variable shall be assigned to as the color.

Correlation Plot

Let us examine the correlation between every two variable using ggcorr. The plot is as shown below :

Those pairs with a colour shade closer to the shade corresponding to +1 or -1 have a stronger correlation. Those closer to +1 have a positive correlation (with increase in value of one variable other variable also increases) and those closer to -1 have a negative correlation (with increase in value of one variable other variable decreases).

Effect of fixed and volatile acidity on the User Feedback

In the fixed acidity plot, we see most of the wine samples with medium rating are densely populated in the region between 6.0 and 9.0 g/dm^3. Samples rated good are sparcely populated. Inspite of the sparse population they seem to be having comparitively higher count in the acidity range 7.0 to 9.0 g/dm^3.
In the volatile acidity plot, we see most of the wine samples with medium rating are populated densely in the acidity range of 0.35 to 0.75 g/dm^3. Samples with good rating, though less populated, are seen more in the acidity range of 0.25 to 0.40 g/dm^3.

Effect of total acidity on the User Feedback

We obtain the total acidity by adding the contents of Tartaric acid(fixed acidity), Acetic acid(volatile acidity) and Citric acid.

Medium rated samples are more clustered and densely populated in the total acidity range of 6.5 to 9.5 g/dm^3. Good rated samples appear scattered all over. They seem to be more in the acidity range of 7.75 g/dm^3 to 10.0 g/dm^3.

Effect of pH on the User Feedback

We know that greater the pH value, lesser will be its acidity value. It makes more sense to relate to the inverse pH value so that we can draw some correlated results.

Here we can see that the samples with inverse pH range in the mid region i.e, between 0.28 and 0.32. Bad samples appear scattered more to the left of 0.31.

Effect of Sugar and Chloride content on the User Feedback

Most of the medium rated samples in the Sugar plot are densely populated in the range of 1.5 to 3.0 g/dm^3. Good rated samples appear more densely in the region between 1.5 and 2.75 of residual sugar.
Coming to Chloride Plot the medium rated samples appear to be clustered in the region between 0.06 to 0.125 g/dm^3 chloride level. Good rated samples appear in the region 0.06 to 0.08 g/dm^3.

Effect of Sulphur Dioxide content on the User Feedback

Most of the medium rated samples in the Sulfur dioxide plot are densely populated in the range of 10 to 60 g/dm^3. Good rated samples are distributed evenly throughout the distributed beginning from 10 to 30 g/dm^3.

Effect of Density on the User Feedback

Medium rated wine is more populated in the density range off 0.9945 to 0.999. Good samples are more populated in the range 0.9940 to 0.9975.

Effect of Alcohol content on the User Feedback

Those samples with medium rating appear more in the alcohol range of 9.0 to 11.5 % by volume. Good samples are more populated in the alcohol range of 10.5 to 12.5 % by volume.

Multi-Variate Plots Section

In this section we analyse multiple variables together to see the effect of one variable on another.

Effect of Acidity and pH on the User Feedback

The plot depicts a clear relationship between acidity and inverse pH. The samples appear most clustered in the acidity range of 6.5 to 11.0 and the Inverse pH range of 0.275 to 0.325. they don’t seem to be related to the feedback which user gives.
Now let us look at the correlation result of total acidity and inverse pH.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$total.acidity and 1/rw$pH
## t = 37.795, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6603527 0.7121633
## sample estimates:
##       cor 
## 0.6871306

A Correlation result of 0.687 is sufficient to confirm that the variables are correlated.

Effect of Density and Alcohol on the User Feedback

This plot would be very helpfull in predicting the User Feedback. Here we observe that medium rated samples appear most in the density range of 0.9955 to 0.9985 and the alcohol range of 9.0 to 10.5. Good rated samples are more populated in the denity range of 0.994 to 0.997 and in the alcohol range of 11.0 to 12.5.
Let us have a look at the plot obtained using geom_smooth()

Let us now divide the above plot into 3 different plots based on the feedback users give.

Good rated samples seem to take a smooth path. Implies that the data is well distributed along the path. Medium rated samples aslo seem to have a smooth path but it falls and then rises at density of 0.9975. Bad rated samples do not take a smooth path. The path is very bumpy.
Now let us see if Density and Alcohol are correlated or not :

## 
##  Pearson's product-moment correlation
## 
## data:  rw$density and rw$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

The correlation value of -0.496 implies they are inversely correlated. With increase in one paramater the other parameter shall decrease.

Effect of Chloride and Sulphur Dioxide level on User Feedback

The plot shows that most or almost all the medium rated samples appear in the chloride range of 0.05 to 0.125 and sulphur dioxide range of 0 to 150. Bad samples seem to have larger chloride values and smaller sulphur dioxide values where as good samples seem to have smaller chloride values and comparitively larger sulphur dioxide values.

Effect of pH and Alcohol on the User Feedback

This plot tells us how pH and Alcohol can together effect the User Feedback. We can see that medium rated samples mostly fall in the 10^pH range of 1000 to 3000 and alcohol range of 9.0 to 11.0 % by volume. Good samples seem to appear in the higher alcohol levels for the same pH range. Bad Samples seem to appear for the same alcohol range for an higher 10^pH value.

Final Plots and Summary

Plot One

The above plot is to show how using a variable we created a new variable to increase its understandability. We used Quality variable to create user_feedback variable.
The entries in the quality variable of the dataset were mapped to create a new variable name user_feedback for qualitative analysis. Rating of 0-4 was treated bad, 5-6 was treated medium and the rest were treated good.
Majority of the wine users had rated the wine 5 or 6. Very few had rated it below 5. Few had rated it above 6. The qualitative plot reflects the same readings but in a more readable form. Most of them have rated the wine Medium, few have rated it as Good and very few have rated it Bad.

Plot Two

This plot would be very helpfull in predicting the User Feedback. Here we observe that medium rated samples appear most in the density range of 0.9955 to 0.9985 and the alcohol range of 9.0 to 10.5. Good rated samples are more populated in the denity range of 0.994 to 0.997 and in the alcohol range of 11.0 to 12.5. The smooth curve shows us that the samples are distributed evenly along its course.

Plot Three

The above plot is to show the Effect of pH of the Wine and Alcohol content on the User Feedback.This plot tells us how pH and Alcohol can together effect the User Feedback. We can see that medium rated samples mostly fall in the 10^pH range of 1000 to 3000 and alcohol range of 9.0 to 11.0 % by volume. Good samples seem to appear in the higher alcohol levels for the same pH range. Bad Samples seem to appear for the same alcohol range for an higher 10^pH value.

Reflection

The main intention of this Project was to determine why a sample of red wine was rated as Good, Medium or Bad. For the same purpose I developed Uni, Bi and Multi-Variate Plots to analyse the dataset. We analysed the dataset stage by stage to reach the final results. First we plotted Uni-Variate plots and then Bi-Variate and lastly Multi-Variate plots.
There were a number of difficulties while analysing data. First difficulty was choosing appropriate parameters for a plot which would give required results or atleast give some insight of the data. Secondly the difficulty in figuring out the distribution where in the points from all types of feedback were mixed up.
I found success after resolving every difficulty. Choosing appropriate points for the plot and resolving the mixed up cluster of points were all my successes. The data set given to me was good. But not sufficient for complete and perfect analysis. Few more variables can be added such as the type of grapes that were used, the region from where the wine comes from, how old the wine is.